Skip to content

[PerfXLab] optimize sqrt op performance#2217

Open
bin913 wants to merge 1 commit intoflagos-ai:masterfrom
bin913:sqrt
Open

[PerfXLab] optimize sqrt op performance#2217
bin913 wants to merge 1 commit intoflagos-ai:masterfrom
bin913:sqrt

Conversation

@bin913
Copy link
Copy Markdown
Contributor

@bin913 bin913 commented Apr 2, 2026

PR Category

[ Operator]

Type of Change

[Performance Optimization]

Description

optimize sqrt op formance

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

test_unary_pointwise_perf.py::test_general_unary_pointwise_perf[sqrt-sqrt-dtypes11] 
Operator: sqrt  Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               1.423072            1.410560               1.009               0.761          [torch.Size([1073741824])]
SUCCESS               0.006176            0.005440               1.135               0.001          [torch.Size([64, 64])]
SUCCESS               0.028768            0.028000               1.027               0.599          [torch.Size([4096, 4096])]
SUCCESS               0.029184            0.027920               1.045               0.601          [torch.Size([64, 512, 512])]
SUCCESS               1.421040            1.411200               1.007               0.761          [torch.Size([1024, 1024, 1024])]


Operator: sqrt  Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               2.826640            2.824240               1.001               0.380          [torch.Size([1073741824])]
SUCCESS               0.006208            0.005504               1.128               0.001          [torch.Size([64, 64])]
SUCCESS               0.050784            0.050144               1.013               0.335          [torch.Size([4096, 4096])]
SUCCESS               0.050720            0.050080               1.013               0.335          [torch.Size([64, 512, 512])]
SUCCESS               2.828192            2.825664               1.001               0.380          [torch.Size([1024, 1024, 1024])]


Operator: sqrt  Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               1.427072            1.411952               1.011               0.760          [torch.Size([1073741824])]
SUCCESS               0.006176            0.005408               1.142               0.001          [torch.Size([64, 64])]
SUCCESS               0.028768            0.027808               1.035               0.603          [torch.Size([4096, 4096])]
SUCCESS               0.029216            0.027840               1.049               0.603          [torch.Size([64, 512, 512])]
SUCCESS               1.422864            1.411168               1.008               0.761          [torch.Size([1024, 1024, 1024])]

triton.Config({"BLOCK_SIZE": 2048}, num_stages=4, num_warps=1),
],
key=["n_elements"],
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please transfer the autotune optimization configurations to src/flag_gems/runtime/backend/_nvidia/hopper/tune_configs.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants